{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Pandas"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "pandas是一个非常好用的容器工具，构建在numpy之上，anaconda已经集成。  \n",
    "pandas在管理结构化数据上非常方便，底层是numpy，所以性能很强劲，而且在处理时间序列数据时非常方便。  \n",
    "jupyter对pandas的支持很好，可以直接显示pandas的数据结构，如DataFrame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = pd.read_csv('test_data.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "使用pd.read_csv即可读取csv文件，返回一个pd.DataFrame"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## head"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "使用data.head()就可以看到这个dataframe的前5行，可以使用data.head(10)看前10行"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "仔细观察的话，可以发现这个dataframe，最上面一行是列名，左侧第一列是行的序号。列名倒是列名，但其实第一列不是行的序号，它可以是任意值，比如\"a\",\"b\",\"c\",这样的字符串，只不过在这个数据集中，它恰好为0，1，2...。我们称最左侧的这一列为索引(index)，可以通过它对行进行存取。一会儿在后面会说存取的问题。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## info"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "data.info()可以看到整个数据的全貌，1460 not-null就表示有1460个非缺失值，object表示这列是字符串类型的，int64表示这列是int类型，float64表示这列是float类型"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "使用data[列明]取一列，返回的是pd.Series，可以这样理解，Series组成DataFrame，DataFrame中的行和列都是Series，DataFrame是二维的，Series是一维的"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data['SalePrice'].head(10) # Series也是支持head的"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 取数据"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "还有一种取多列的方式，将要取的列，写到一个list中"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "columns = ['LotFrontage', 'SalePrice']\n",
    "data[columns].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这样就可以截取多列"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "使用.iloc截取行，第一行就用data.iloc[0]，第i行就用data.iloc[i-1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.iloc[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "返回的也是一个Series，Series类似字典，可以用左侧的index取到右面的值，比如"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.iloc[0]['SalePrice']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.iloc[0]['SaleType']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "同样的道理，也可以先取列，再取行"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data['SalePrice'].iloc[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "刚才不是说左侧的0，1，2是索引吗，所以dataframe也是可以通过索引来取值的，使用data.loc[index]来取行。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "可以看到，第一行的索引为0，所以可以通过data.loc[0]取到第0行。假设第一行的索引为\"a\"，那我们就可以用data.loc[\"a\"]取到第一行。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.loc[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## drop"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "使用data.drop(列名, axis = 1, inplace = False)可以丢掉一列，inplace表示是否在原地操作，如果为True，就会直接删除data中的这列，不会有返回值，如果为False，就会返回一个删除到该列的新的DataFrame，不改变原来的DataFrame。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.drop('SalePrice', axis = 1, inplace = False).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果drop的axis = 0，就可以删除行"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## describe"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "使用data.describe()可以看到关于这个dataframe的描述，count表示有多少个非缺失值，mean表示均值，std标准差，min最小值，25%是第一四分位数，50%第二四分位数，75%第三四分位数，max表示最大值"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## sum, mean, std, max, min"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "和numpy类似，pandas也是支持这些方法的"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### test1 尝试使用这些方法\n",
    "请你输出data.sum(), data.mean(), data.std(), data.max(), data.min()的结果"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# YOUR CODE HERE\n",
    "\n",
    "data.sum()\n",
    "data.mean()\n",
    "data.std()\n",
    "data.max()\n",
    "data.min()\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 输出"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "使用data.to_csv(\"文件名\")就可以将当前的dataframe保存为.csv文件，同理还有to_excel, to_json, to_dict等方法"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 用.values转化为numpy\n",
    "print(data.iloc[:5,:5].values)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### test2 尝试保存文件"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# YOUR CODE HERE\n",
    "data.to_csv('foo.csv')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **这里介绍的只是Pandas强大功能的冰山一角，希望大家可以养成查阅官方文档的习惯**\n",
    "Pandas文档： https://pandas.pydata.org/pandas-docs/version/0.24/index.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
